AITopics | multiple speaker

Collaborating Authors

multiple speaker

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Representation of perceived prosodic similarity of conversational feedback

Qian, Livia, Figueroa, Carol, Skantze, Gabriel

arXiv.org Artificial IntelligenceMay-20-2025

V ocal feedback (e.g., 'mhm', 'yeah', 'okay') is an important component of spoken dialogue and is crucial to ensuring common ground in conversational systems. The exact meaning of such feedback is conveyed through both lexical and prosodic form. In this work, we investigate the perceived prosodic similarity of vocal feedback with the same lexical form, and to what extent existing speech representations reflect such similarities. A triadic comparison task with recruited participants is used to measure perceived similarity of feedback responses taken from two different datasets. We find that spectral and self-supervised speech representations encode prosody better than extracted pitch features, especially in the case of feedback from the same speaker. We also find that it is possible to further condense and align the representations to human perception through contrastive learning.

artificial intelligence, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2505.13268

Country: North America > United States (0.28)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.94)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.93)
(2 more...)

Add feedback

Building a Luganda Text-to-Speech Model From Crowdsourced Data

Kagumire, Sulaiman, Katumba, Andrew, Nakatumba-Nabende, Joyce, Quinn, John

arXiv.org Artificial IntelligenceMay-16-2024

Text-to-speech (TTS) development for African languages such as Luganda is still limited, primarily due to the scarcity of high-quality, single-speaker recordings essential for training TTS models. Prior work has focused on utilizing the Luganda Common Voice recordings of multiple speakers aged between 20-49. Although the generated speech is intelligible, it is still of lower quality than the model trained on studio-grade recordings. This is due to the insufficient data preprocessing methods applied to improve the quality of the Common Voice recordings. Furthermore, speech convergence is more difficult to achieve due to varying intonations, as well as background noise. In this paper, we show that the quality of Luganda TTS from Common Voice can improve by training on multiple speakers of close intonation in addition to further preprocessing of the training data. Specifically, we selected six female speakers with close intonation determined by subjectively listening and comparing their voice recordings. In addition to trimming out silent portions from the beginning and end of the recordings, we applied a pre-trained speech enhancement model to reduce background noise and enhance audio quality. We also utilized a pre-trained, non-intrusive, self-supervised Mean Opinion Score (MOS) estimation model to filter recordings with an estimated MOS over 3.5, indicating high perceived quality. Subjective MOS evaluations from nine native Luganda speakers demonstrate that our TTS model achieves a significantly better MOS of 3.55 compared to the reported 2.5 MOS of the existing model. Moreover, for a fair comparison, our model trained on six speakers outperforms models trained on a single-speaker (3.13 MOS) or two speakers (3.22 MOS). This showcases the effectiveness of compensating for the lack of data from one speaker with data from multiple speakers of close intonation to improve TTS quality.

female speaker, intonation, multiple speaker, (12 more...)

arXiv.org Artificial Intelligence

2405.10211

Country:

Africa > Uganda > Central Region > Kampala (0.05)
Africa > East Africa (0.04)

Genre: Research Report (0.65)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.73)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.69)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.62)

Add feedback

Detecting Check-Worthy Claims in Political Debates, Speeches, and Interviews Using Audio Data

Ivanov, Petar, Koychev, Ivan, Hardalov, Momchil, Nakov, Preslav

arXiv.org Artificial IntelligenceMay-24-2023

A large portion of society united around the same vision and ideas carries enormous energy. That is precisely what political figures would like to accumulate for their cause. With this goal in mind, they can sometimes resort to distorting or hiding the truth, unintentionally or on purpose, which opens the door for misinformation and disinformation. Tools for automatic detection of check-worthy claims would be of great help to moderators of debates, journalists, and fact-checking organizations. While previous work on detecting check-worthy claims has focused on text, here we explore the utility of the audio signal as an additional information source. We create a new multimodal dataset (text and audio in English) containing 48 hours of speech. Our evaluation results show that the audio modality together with text yields improvements over text alone in the case of multiple speakers. Moreover, an audio-only model could outperform a text-only one for a single speaker.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2306.05535

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > California > Los Angeles County > Long Beach (0.04)
North America > United States > Arizona > Maricopa County > Scottsdale (0.04)
(3 more...)

Genre: Research Report > New Finding (0.49)

Industry:

Media > News (1.00)
Government > Regional Government > North America Government > United States Government (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.68)

Add feedback

Tech Talk: How AI Is Serving the Restaurant Industry

#artificialintelligenceNov-16-2022, 01:47:32 GMT

As the Chief Revenue Officer at HungerRush, Olivier Thierry is influencing customer expectations with AI as the restaurant industry has begun experimenting with it, he tells Spiceworks News & Insights' Technology Editor, Neha Kulkarni. Restaurants have realized taking on new technology will help them not only survive the challenges but achieve results, he notes. From labor shortages to improving customer experience, in this edition of Tech Talk, Olivier discusses how AI can overcome these challenges and allow restaurants to reduce human error. He also shares how natural language processing can interpret customer attitudes in phone orders and have a real place in understanding customer experience. Olivier: The pandemic turned the restaurant industry upside down, and many of its setbacks are still being felt today.

customer experience, olivier, restaurant, (12 more...)

#artificialintelligence

Industry: Consumer Products & Services > Restaurants (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language (0.91)

Add feedback

Multi-speaker Text To Speech

#artificialintelligenceOct-7-2021, 13:40:33 GMT

Speech synthesis (Text-to-speech, TTS) is the formation of a speech signal from printed text. In a way, it is the opposite of speech recognition. Speech synthesis is used in medicine, dialogue systems, voice assistants and many other business tasks. As long as we have one speaker, the task of speech synthesis at first glance looks quite clear. When several speakers come into play, the situation becomes somewhat complicated and other tasks come into play; for example, voice cloning and voice conversion, this will be discussed further in the text.

information, representation, synthesis, (12 more...)

#artificialintelligence

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.96)

Add feedback

Text to Speech System for Multi-Speaker Setting

#artificialintelligenceMay-26-2021, 07:20:25 GMT

What would you want to do if you could generate the voice of your favorite celebrity? Before I get ahead of myself, let me clearly define the objective of this blog. Given text and some voice clips of the desired speaker (say, Beyonce), I want my AI to output an audio clip where Beyonce is speaking the text that I input to this code. So essentially, this is the same Text To Speech (TTS) problem we saw earlier but with an added constraint to output the speech in a particular speaker's voice. In this blog, I share two methods that can complete our task, and I will be comparing these two methods at the end.

beyonce, dataset, speaker encoder, (14 more...)

#artificialintelligence

Technology:

Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.65)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.65)
Information Technology > Artificial Intelligence > Assistive Technologies (0.65)

Add feedback

Voice 'Fingerprint' Propels Speaker Recognition

#artificialintelligenceNov-25-2018, 20:11:19 GMT

The accuracy of automatic speech recognition has made significant gains in the last few years thanks to the advent of deep neural networks. But there's one area that has thwarted researchers: telling multiple speakers apart. Now a startup called Chorus says it has made a breakthrough in the matter through a technique it calls "voice fingerprinting." Speech recognition and computer vision arguably are the two computational challenges that have benefited the most from deep learning. Armed with huge training sets – including vast troves of photographs and digital recordings of voices – convolutional neural network (CNNs) and recurrent neural networks (RNNs) have given computers sensory perception that can almost rival humans' senses.

artificial intelligence, chorus, machine learning, (16 more...)

#artificialintelligence

Country: North America > United States > California (0.15)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

How to listen to live baseball games on an Amazon Echo

PCWorldMar-29-2018, 21:10:45 GMT

Now that baseball season is underway, one of the easiest ways to listen to the games is on an Amazon Echo or another Alexa device. With TuneIn Live or MLB At Bat, you can stream live broadcasts from any Major League Baseball game using voice commands. TuneIn's service costs $3 per month for Amazon Prime subscribers (or $4 per month for non-subscribers) and also includes news and live sports from other leagues. MLB's Gameday Audio service costs a one-time payment of $20 for the entire 2018 season. Audio streams are also included with an MLB TV Premium subscription, which offers live video broadcasts for out-of-market games and costs $25 per month or $116 for the season.

artificial intelligence, chatbot, natural language, (15 more...)

PCWorld

Industry: Leisure & Entertainment > Sports > Baseball (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (0.91)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.91)

Add feedback

An AI has learned how to pick a single voice out of a crowd

New ScientistOct-24-2017, 16:31:21 GMT

Devices like Amazon's Echo and Google Home can usually deal with requests from a lone person, but like us they struggle in situations such as a noisy cocktail party, where several people are speaking at once. Now an AI that is able to separate the voices of multiple speakers in real time promises to give automatic speech recognition a big boost, and could soon find its way into an elevator near you. The technology, developed by researchers at the Mitsubishi Electric Research Laboratory in Cambridge, Massachusetts, was demonstrated in public for the first time at this month's Combined Exhibition of Advanced Technologies show in Tokyo. It uses a machine learning technique the team calls "deep clustering" to identifies unique features in the "voiceprint" of multiple speakers. It then groups the distinct features from each speaker's voice together, allowing it to disentangle multiple voices and then reconstruct what each person was saying.

artificial intelligence, machine learning, natural language, (8 more...)

New Scientist

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.27)
Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.27)

Industry:

Automobiles & Trucks > Manufacturer (0.66)
Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.56)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.96)
Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (0.59)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.59)

Add feedback